(N 



O 



H 

c3 



Strong Oracle Optimality of Folded Concave Penalized 

Estimation 



o . 

CN ■ Jianqing Fan, Lingzhou Xue and Hui Zou 



Princeton University and University of Minnesota 



■ This Version: October 20th, 2012 



Abstract 



Folded concave penalization methods (jFan and Li 



2001 



have been shown to en- 



joy the strong oracle property for high-dimensional sparse estimation. However, a 
folded concave penalization problem usually has multiple local solutions and the ora- 
cle property is established only for one of the unknown local solutions. A challenging 
fundamental issue still remains that it is not clear whether the local optimal solution 



> 

On 
ON 

in 

<^ ■ computed by a given optimization algorithm possesses those nice theoretical properties 



CN . To close this important theoretical gap in over a decade, we provide a unified theory to 

show explicitly how to obtain the oracle solution using the local linear approximation 
^ . algorithm. For a folded concave penalized estimation problem, we show that as long 

as the problem is localizable and the oracle estimator is well behaved, we can obtain 
the oracle estimator by using the one-step local linear approximation. In addition, 
once the oracle estimator is obtained, the local linear approximation algorithm con- 
verges, namely produces the same estimator in the next iteration. The general theory 
is demonstrated by using three classical sparse estimation problems, i.e. the sparse lin- 
ear regression, the sparse logistic regression and the sparse precision matrix estimation, 
where the LASSO penalized least squares, the LASSO penalized logistic regression and 
the CLIME are used as the initial estimator, respectively. 

Key Words: Folded concave penalty; Local linear approximation; Non-convex optimization; 
Oracle estimator; Sparse estimation; Strong oracle property. 
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1 Introduction 



Sparse estimation is at the center of the stage of high- dimensional statistical learning. The 
two mainst ream methods are the LASSO (or l\ penalization) and the folded concave pe- 



nalization ( Fan and Li 



200ll ) such as the SCAD and the MCP. Numerous papers have 
been devoted to the numerical and theoretical study of both methods. A strong irrep- 



resentable condi t ion is necessary for the LASSO to be selection consistent ( jZhao and Yu 



2006 



Zou 



2006 



Meinshausen and Buhlmann 



20061 ). The folded concave penalization, un- 



like the LASSO, does not require the irrepresentable condition to achieve sele ction consis- 
tency and can corre c t the intrinsic estimation bias of the L ASSO penalization (IFan and Li . 



2001 



Fan and Pengi . 



2004 



Zhangi . 



2010a 



Fan and Lv 



20111 ). The LASSO owns its popular- 



ity largely to its computational properties. For certain learning problems, such as the LASSO 



penalized least squares, the solution paths are piecewise linear which al 



a LARS-type algorithm to compute the entire solution path efficiently (jEfron et al 



ows one to emplo y 



200J). 



For a more general class of LASSO penalized problems, the coordinate de scent algorithm 



has been shown to be very useful and efficient (IFriedman et al 



2008 



2010I). 



The computation for folded concave penalized methods is much more involved, because 
the resulting optimization problem is usually non-convex and has multiple local minimizers. 
Seve ral algorithms hay e been developed for computing the folded concave penalized estima- 



tors. 



Fan and Lil (120011 ) worked out the local quadratic approximation (LQA) al gorithm as a 



Zou and Li 



unified method for computing the folded concave penalized maximum likelihood 
(120081 ) proposed the local linear approximation (LLA) algorithm which turns a concave pe- 
nalized problem into a series of reweighed li penalizat i on problems. Both L Q A and LLA are 



related to the MM principle (jHunter and Langd . 120041 ; iHunter and Lil . 120051 ) . IZhangi (j2010al ) 
devised a PLUS algorithm for solving the penalized least squares using the MCP and the 
SCAD. Recently, coordinate descent was app l ied to solve the folded concave penalized least 



squares (IMazumder et al. 



2011 



Fan and Lv 



201 ll ). With these advances in computing al- 



gorithms, one can now at least efficiently compute a local solution of the folded concave 
penalized problem. It has been shown repeatedly that the folded concave penalty performs 



better than the LASSO in various high-dimension al sparse est i mation prob 



include sparse linear regression model estimation (IFan and Lil . 



2001 



Zhang 



ems. E xamples 



2010al ). sparse 
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generalized linear model estimat ion 



model estimation ( Bradic et al. 



'an and Lv 



201ll ). sparse Cox's propo rtional hazards 



20121 ). s parse precision m atrix estimation (ILam and Fan 



20091 ) and sparse Ising model estimation (IXue et al.l . |2012| ). among others. 

Before declaring that the folded concave penalization is superior to the LASSO, we still 
need to resolve a missing puzzle in the picture. The optimal theoretical properties of the 
folded concave penalization are established for a theoretic local solution. However, we have 
to employ one of these local minimization algorithms to find such a local solution. It still 
remains to prove that the computed local solution is the desired theoretic local s olution 



to make the theory fully relevant. 



Fan and Lv 



2011 



Zhang and Zhang , 



Many have tried to address this issue (jZhangi . 



2010a 



20121 ). The basic idea there is to find conditions under 



which the folded concave penalized problem actually has a unique minimizer and hence 
eliminate the problem of multiple local solutions. Although this line of thoughts is very 
natural and logically intuitive, the imposed conditions for the unique minimizer are too 
strong to be realistic. 

In this paper we offer a very different and direct approach to deal with the multiple local 
solutions issue. We outline a general procedure based on the LLA algorithm for computing 
a specific local solution of the folded concave penalization problem and then derive a lower 
bound on the probability that this specific computed solution exactly equals to the oracle 
solution. This probability lower bound equals 1 — 8q — Si — 82 where Sq corresponds to the 
localizability of the underlying model, Si and S2 represent the regularity of the oracle esti- 
mator and they have nothing to do with any actual estimation method. Explicit expressions 
of Sq, Si and S2 are given in Section 2. Under weak regularity conditions, Si and 82 are very 
small. Thus, if Sq goes to zero then the computed LLA solution is the oracle estimator with 
an overwhelming probability. On the other hand, if 8q cannot go to zero then it means that 
the underlying model is extremely difficult to estimate no matter how clever an estimator 
is. Therefore, our theory suggests a "bet-on-folded-concave-penalization" principle, since as 
long as there is a reasonable estimator our procedure can deliver an optimal estimator using 
the folded concave penalization via the one-step LLA implementation. Furthermore, we use 
concrete examples to show how to prove all tail probabilities 8q, Si and 82 go to zero at a 
fast rate under the ultra-high dimensional setting where log(p) = 0(n v ) for some < r] < 1. 

Throughout this paper the following useful notation will be used. For a matrix U = (uij), 
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denote by ||C/|| min = min^) \uij\ the minimum absolute value, and denote by A min (L7) and 
Amax(C^) the smallest and largest eigenvalues of U, respectively. We also use several matrix 
norms: the l\ norm || = max, Yli \ u ij |j the ^2 norm ||C/||^ 2 = a/ \ m a X (U'U), the 
norm HE/H^ = maxj Y2j \ u ij\> the entry-wise l\ norm = \ u ij\ anc ^ the entry- wise 

£oo norm ||C/|| max = max^j) \uij\. For any symmetric matrix, its l\ norm is equal to its 
norm. 



2 Main Results 

We begin with a somewhat abstract /general presentation of the sparse estimation prob- 
lem. Consider estimating a model based on n independent and identically distributed p- 
dimensional observations, where the feature dimension p is much larger than the sample 
size n. The target of estimation is a p-dimensional "parameter" (3* = (Pi, ■ ■ ■ , Pt)' , that is, 
the underlying model is parameterized by /3*. Remark that in some problems the target of 
estimation (3* can be a matrix (e.g., a covariance matrix). In such cases it is understood 
that (Pi, ■ ■ ■ , Pi)' is the vectorization of the matrix (3*. Denote its corresponding support 
set as A = {j : P* j£ 0} with the cardinality to be s = \A\. The sparsity assumption means 
that s <ti p. 

Suppose that our estimation scheme is to get a local minimizer of the penalized convex 
loss function problem 

mm £ n ((3) + P x (\(3\), (1) 

where £ n ((3) represents the convex loss function and P\(\f3\) = P\(\Pj\) is a folded con- 
cave penalty function. The above formulation is a bit abstract but covers many important 
statistical models and estimators. For example, £ n ((3) can be the squared error loss in pe- 
nalized least squares and the negative log-quasi-likelihood function in penalized maximum 
quasi-likelihood. 

An oracle knows the true support set A of the underlying model and the oracle estimator 
is defined as 

^oracle ^oracle . , , , 

(3 =((3 A ,0) = arg min £ n (f3). (2) 

PA' PA c -° 

We assume throughout the paper that the problem is regular so that the oracle solution is 
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;orae£e. 



unique, satisfying 

V,U/-i ) (). VjeA (3) 

where Vj is the partial derivative with the j th component of f3. Note that the oracle esti- 
mator is not a feasible estimator but it can be used as a theoretical benchmark for other 
estimators to compare with. An estimator is said to have the oracle pr operty if the esti 



mator and the orac 



Fan and Peng 



e estimator have the same asymptotic distribution (IFan and Li 



2001 



20041 ). Moreover, an estimator is said to have the strong oracl e property i 



the estimator equals th e oracle estimator with an overwhelming probability (IKim. et. al 



2008 



Fan and Lv 



201 lh . 



Throughout this paper, we only need to assume that £ n {-) is a differentiable convex 
function, and we also assume that -PaQ^I) = Pa,\(\t\) is a general folded concave penalty 
function defined on t 6 (— oo, oo) satisfying 

(i) P\(t) is increasing and concave in t G [0, oo); 

(ii) P x {t) is differentiable in t G (0, oo) with P A (0) := P A (0+) > a x A; 

(iii) P' x {t) > a x \ for t G (0,a 2 A]; 

(iv) P' x {t) = for t G [aA, oo) with the pre-specified constant a > ci2. 

where a\ and a 2 are some fixed positive c onstants. Note the defini t ion fol l ows and ex- 



tends previous works on SCAD and MCP (IFan and Li 



2001 



Zhang 



2010a 



Fan and Lv 



20111). Folded con cave penalty was introduced to bridge the £\ penalty and the £ penalty 



Fan and Li 



20011 1 . On the interval [— 02 A, 0,2 A], the desired penalty should penalize small 



coefficients as the l\ penalty, and on the intervals outside the interval (— aA, aA), the penalty 
function should behave more like the £q penalty to avoid introducing biases. The above family 
of general folded concave penalties has i ncluded sey e ral po pular concave penalties proposed 



in recent years, for example the SCAD (IFan and Li . 

(aA — 1) + 



200 ll ) whose derivative is given by 



P'xit) = XL 



{t<\} 



! {t>A}, 



for some a > 2, 



and the MCP flZhand . 



2010af ) whose derivative is given by 



W) = (A 



for some a > 1. 



By simple calculation, it is easy to see that a\ = = 1 for the SCAD and a\ = 1 — a , 
a 2 = 1 for the MCP. 

Numerical results have been provided in the statistical literature to show that the folded 
concave penalty performs much better than the l\ penalty in terms of both model estima- 
tion accuracy and variable selection consistency. To offer theoretical understanding of their 
differences, it is important to show that the obtained local solution of the fold concave pe- 
nalized estimator has better theoretical properties than the LASSO estimator. However, 
a general technical difficulty in the folded concave regularization problems is to show that 
the computed local solution is the local solution with proven theoretical properties. Under 
strong conditions, it has been argued that the folded concave penalized problem has a unique 
minimizer and hence any algorithm finding a local solution will find the global minimizer 



(jZhangl . 



2010 



Fan and Lv . 



2011 



Zhang and Zhang . 



20121 ). The problem with this argu- 



ment is that in reality it is very rare that the folded concave penalized problem actually has 
a unique minimizer, which in turn implies that these strong conditions are too stringent to 
hold in practice. See the numerical results in Section 4. 

We argue that, although the estimator is defined via a folded concave penalization prob- 
lem, we only care about the properties of computed estimator. It is perfectly fine that the 
computed local solution is not the global minimizer, as long as it has the optimal or desired 
statistical properties. In this pape r we directly analy ze a specific solution by the local linear 



approximation (LLA) algorithm (IZou and Lil . 120081 ). The LLA algorithm takes advantage 



of the special folded concave structure of penalty functions and utilizes the majorization- 
minimization principle to turn a concave regularization problem into a sequence of weighted 
l\ penalization problems. Within each iteration of the LLA algorithm, the underlying local 
linear approximati on is actually the b est convex majorization of the concave penalty function 
(see Theorem 2 of IZou and Lil (120081 )). Moreover, the majorization- minimization principle 
has provided theoretical justification to guarantee the convergence of the LLA algorithm to 
a stationary point of the concave regular i zation problem ([TP. The LLA con v ex relaxation 



idea has been used in 



Candes et al. 



(120 12 ) and lHuang and ZhansJ (1201a ). 



(l2008h . lZhang 



fl2010bh . 



Fan and Lv! (1201 lh . 



Bradic et al. 



Here, we summa r ize the details of the LLA algorithm as in Algorithm 1. 



Remark 1. 



Zhang! (j2010bl ) gave a high-dimensional analysis of the LLA algorithm in the 
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Algorithm 1 The LLA algorithm 



-^(0) -^initial 

1. Initialize (3 = (3 and compute the adaptive weight 



w 



(°) _ ^,(°) 



2. For m = 1,2,..., repeat the LLA iteration till convergence 
(2. a) Obtain /3 by solving the following optimization problem 

(2.b) Update the adaptive weig ht vector w {m) with w\ m) = P' x (\fc m) \). 



p = mm 

/3 



linear regression models, and iHuang and Zhang! (120121 ) further provided a detailed tech 



nical analysis of the L 



A algorithm in the high-dimensional generalized linear models. 



Huang and Zhang) (120121 ) required the convex loss function £ n (/3) to be twice differentiate, 



and the theoretical results critical 
Definition 3 of 



ly de pend on the complex general invertibility factor, of. 



Huang and Zhang! (120121 ) . In this work, we consider a more general folded con- 



cave penalized convex loss problem without requiring £ n (f3) to be twice different iable, and we 
discuss how the LLA algorithm can actually find the oracle estimator ([2]) with an overwhelm- 
ing probability under fairly weak regularity conditions. Especially, ou r high-dimensional 



analys is does not depend on the complex general invertibility factor as in 
(120 12k 



Huang and Zhang 



In the following theorems, we provide the non-asymptotic analysis of the LLA algo- 

— oracle 

rithm for obtaining the oracle estimator (3 in the folded concave penalized problem if 

^initial 

it is initiated by some initial estimator (3 . To simplify notation, we define W„(/3) = 
(Vi£ n (/3), ■ • • , V p £„(/3)) as the gradient vector of £ n ((3). Moreover, denote by A c the com- 
plement of the true support set A, i.e. A c = {j : f3* = 0}, and set V j^i n {f3) = (Vj£ n ((3) : 
j G A c ) with respect to A c . 

Theorem 1. Suppose the minimal signal strength of f3* satisfies that 



(AO) \\(3* A \\ min > (a+l)A. 
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Consider the folded concave penalized problem with P\(-) satisfying (i)-(iv). Let Oq = 
min{l,a 2 }. Under the event 

f ,, ^initial ") ( ,, ,^oracle, ,, ~) 

£i = {\\P -/31max<aoA)n|||V^4(/3 )|| ma x<a 1 A|, 

-^-initial ■^•oracle 

the LLA algorithm initiated by p finds the oracle estimator p after one iteration. 
Applying the union bound to £±, we easily get the following corollary. 

^initial 

Corollary 1. With a probability at least 1 — 5q — 8\, the LLA algorithm initiated by (3 

-^oracle 

finds the oracle estimator (3 after one iteration, where 

/,, ^initial \ 

<5 = Pr(j|/3 - Climax >OoAj 

and 

Si = Pr (||V A c4(/3 )||max>aiAJ. 

Remark 2. By its definition, Sq represents the localizability of the underlying model. To 
apply Theorem [1] we need to have an appropriate initial estimator to make 5q go to zero as n 
and p diverge to infinity, namely the underlying problem is localizable. In Section 3 we will 
show by concrete examples that how to find a good initial estimator to make the problem 
localizable. 8\ represents the regularity behavior of the oracle estimator, i.e., its closeness to 
the true "parameter" measured by the score function. Note that V^c£ n (/3*) is concentrated 
around zero. Thus, 8% is usually small. 

In summary, Theorem [T] and its corollary state that as long as the problem is localizable 
and regular, we can find an oracle estimator by using one-step local linear approximation, 
whic h can be regarded a s the generalization of the LLA algorithm and the one-step estimation 



idea ( iZou and Li 120081 1 to the high-dimensional setting. 



Theorem 2. Consider the folded concave penalized problem with P\(-) satisfying (i)-(iv). 
Under the event 

f* •^^otclcIg "1 C ■ — ovclcIg "1 

£2 = [\\V A c£ n {(3 ' )|| max < a x \ \ n I Umin > a\> , 

^oracle 

as long as the LLA algorithm finds the oracle estimator (3 , the LLA algorithm will find 

^oracle ^oracle 

(3 again in the next iteration, i.e. the LLA algorithm converges to [3 in the next 
iteration. 



Now we combine Theorems [T] and [2] to derive the non-asymptotic probability bound for 

^oracle 

the LLA algorithm to exactly converge to the oracle estimator (3 in the general folded 
concave penalized problem flTJ). 



Corollary 2. Consider the folded concave penalized problem with P\(-) satisfying (i)-(iv). 

^initial 

Under the assumption of (AO), the LLA algorithm initiated by f3 converges to the oracle 

^oracle 

estimator f3 after two iterations with a probability at least \ — 5$ — 5\ — 82, where 



/ ,, ^oracle ,, 

5 2 = Pr [\\(3 A |U < a\ 



Remark 3. The localizable probability 1 — 5q and regularity probability 1 — 61 have been 
defined before. 62 is a probability on the magnitude of the oracle estimator. Both S\ and 
82 are related to the regularity behavior of the oracle estimator and will be referred to the 
oracle regularity condition. Under the minimum signal condition (AO), it requires only the 

^oracle 

uniform convergence of f3 A . Namely, 

/ , — otclcIg \ 

S2<Pr{\\{3 A -(3 A \U X >\). 

Thus we can regard 62 as a direct measurement of the closeness of the oracle estimator to 
the true "parameter" and is usually small because of a small intrinsic dimensionality s. This 
will indeed be shown in Section 3. 



3 Theoretical Examples 



In the sequel, we outline three classical examples to demonstrate interesting and powerful 
applications of Theorems [1] and [2] to solve folded concave penalization problems. We need 
basically to check the localizable condition and the regularity condition for these problems. 

We focus specifically on the least-squares, logistic regression, and sparse precision matrix 
estimation to derive a more explicit bound and to give cleaner results and proofs. For more 



general cases in the family of the generalizec 
LASSO can be verified by using the result of 



linear mode l s, the localizable condition 5q using 



Fan and Lvl (120 111 ) and the regularity conditions 



ca n be verified by usin g the concentration inequality of the maximum likelihood estimator 
in Fan and Son J (120111 ) . 
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3.1 Sparse linear regression 

The first example is the canonical problem of the folded concave penalized least square 
estimation, i.e. 

min ±-\\y-Xf3\\l + J2 P ^ ( 4 ) 

3 

where y G lR n and X = (xi, x 2 , ■ ■ ■ , x n )' G M nxp . Let /3* be the true parameter vector in 
the linear regression model y = X/3* + e, and then the true support set of (3* = {Pj)i<j<p is 
A = {j : (31- 7^ 0}. For the folded concave penalized least square problem, the oracle solution 

^oracle ^oracle 

has an explicit form of (3 LS = ((3 A , 0) with 

^oracle , , , 

(3 A = (XUX^X^y, 

and the Hessian matrix is X'X regardless of f3. Applying Theorems [I] and [21 we can derive 
the following theorem with explicit upper bounds for 5\ and 5 2 , which depends only on 
behavior of the oracle estimator. 

f initial \ 

Theorem 3. Let 5^ s = Pr ( \\f3 LS - f3*\\ mSuX > a \) . Suppose that 

(Al) y = Xf3* + e with e = (si, . . . ,e n ) being i.i.d. sub-Gaussian(cr) for some fixed constant 
a>0, i.e. £[exp(tef)] < exp(a 2 t 2 /2). 

^initial ^oracle 

The LLA algorithm initiated by (3 LS converges to the oracle estimator (3 LS after two 
iterations with a probability at least 1 — 5q s — 5{ s — 5% s , where 

and 

4 5 = 2s ■ exp (-^(\\(3 A \U - aA) 2 ) , 

where A m i n = A m i n (^X^X^4) and M = max.,- r||aJ(j)||| 2 j which is usually 1 due to normaliza- 
tion, with ccy) = (x±j, ■ ■ ■ ,x n j)'. 

By Theorem [3] both tail probabilities 5f s and 5% s go to zero very quickly. Then it remains 
to bound Sq S . To analyze 5n S we sh ould decide the initial estimator. Here we consider the 



LASSO estimator ( Tibshirani 



19961 1 as a natural choice to initialize the LLA algorithm, 



where the LASSO estimator is defined by 

-■lasso 1 



PTs" = argmin -^-\\y - X(3\\\ + A iasfl0 ||/3||^. (5) 
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^■initial 

Note that LASSO corresponds to the LLA estimator with initial estimator (3 =0. 

-^lasso 

In order to derive the estimation bound on f3 LS — (3 , we invoke the following restricted 
eigenvalue condition, 

\\Xu\ 



(CI) K LS 



mm 



Such a condition has b een studied in 
( 201(1 ) and 



Raskutti et al. 



(2 



e (o,oo). 



Bickel et al 



Negahban et al. 



([20091 ) : h/an De Geer and Biihlmannl (120091 ) 



(120121 ). Under the sub- Gaussian noise assump- 



tion (Al) and also the restricted eigenvalue condition (CI), the LASSO estimator can yield 



~lasso 



a unique optimal solution (3 LS such that 



11/3' 



lasso 
LS 



\C-2 



< 



lasso 



with probability at least 1 — c^expC— cinA?^) where c 1 and c 2 are two fixed positive con- 



stants. See Corollary 2 of 

^lasso 

bound for \\f3 LS - (3 



Negahban et al. 



(120 121 ) for more details. Thus, using this as upper 



it is easy for us to obtain the following corollary. 



Corollary 3. Under the assumptions of (AO), (Al) and (CI), as long as A is chosen to 

1 i /o ^lasso 

be greater than 2(aoKLs)~ s ^lasso, the LLA algorithm initiated by (3 LS converges to the 

^-oracle 

oracle estimator (3 LS after two iterations with a probability at least I — C2 exp(—cin\f asso ) — 
5^ s — #2 5 ; where 5± s and 5^ are given in Theorem^ 

Remark 4. Before concluding this example we would like to emphasize that Theorem [3] 
is independent of the initial estimator. Although we have considered using the LASSO 
penalized least squares estimator as the initial estimator, we can also use Dantzig selector 



(ICandes and Tad. 



condition (CI) (IBickel et al. 



20071) as the init ial estimator and the same analysis can go through under 



2009^ 



3.2 Sparse logistic regression 

The second example is the folded concave penalized logistic regression. Assume that 

(A2) the conditional distribution of yi given X{ {i — 1, 2, . . . , n) is a Bernoulli distribution 
with Prfo, = l|aJi,/3*) = exp(^/r)/(l + exp(^/3*)). 
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Then, the penalized logistic regression is given by 

m j n ^E{-^+^^)> + E p ^i^i)' ( 6 ) 

» j 

where ^(i) = log(l + exp(t)) is the canonical link function. This model is the canonical 
statistical model for high- dimensional binary classification problems, and it is a classical 
example of the generalized linear model. 
The oracle estimator is given by 

^oracle ^oracle 1 ^ — > 

a /3: /3^c=0 ri * — ' 

i 

For ease of presentation, we define 
and 

S(/3) = diag{^'( a; / 1 /3),...,^ , K/3)}. 
In addition, we introduce the following three useful quantities: 

Qi = maxA max (-X^diag{|a? (i )|}X^); 

j n 

q 3 = wx'^mx^x'^mx^- 1 ^. 

in which diag{|£E(j)|} is a diagonal matrix with elements {|^ij|}" = i- 
We first derive bounds on 5\ and 82 ■ 

Theorem 4. Let 5Q° git = Pr (0Logu _ P*\\max > a o^j ■ Under Assumption (A2), the LLA 

^initial ^oracle 

algorithm initiated by /3 Logit converges to the oracle estimator (3L ogit after two iterations 
with a probability at least 1 — 5$ ° 9%t — S^ ogit — S^° 9lt t where 

cLogit ( n ■ \ 2 a\\ 2 

o 1 = 2s ■ exp I — — • mm ' 



+2(p — s) ■ exp 
where M = maxj ^ _1 ||a3(j)||| 2 and 



M \QlQts^ 2(1 + 2Q 3 ) 2 
a\n\ 2 



2M 



St* 9 " = 2s ■ exp (-^ ■ min \W A \U ~ ^) 2 }) 
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Under fairly weak assumptions, both t^° 9%t and b^° 9%t go to zero very quickly. The re- 
maining challenge is to bound b^° 9%t ' . We consider using the ^-penalized maximum likelihood 
estimator as the initial estimator, i.e. 

flZlit = argmin - {-yi.x' t f3 + ^{x[f3)} + A JoflSO ||/3|| €l . 



/3 n 



-lasso 



In the following theorem, we provide the estimation bound on (3 Logit — (3*. 
Theorem 5. Let m = maxuj) \xij\. Suppose that 
(C2) K Logit = mm n e 0,oo . 

u^0:||u A c|U 1 <3||u.a|U 1 U'U 

^lasso 1 

Then the LASSO estimator f3 Logit with Xi asS o < K Logii(20ms) satisfies 

^lasso ^ -jy2 

\\0Logit ~ ft 1^2 — ^ K Logit S ^lasso 

with a probability at least 

In light of Theorem [5l we can obtain the following corollary. 



Corollary 4. Under the assumptions of (AO), (A2) and (C2), as long as A is chosen to 

1 i /9 ^lasso 

be greater than 5(ao/«Ls) s ' Xi aS so, the LLA algorithm initiated by f3 Logit converges to the 

■^oracle , 

oracle estimator f3 Logit after two iterations with a probability at least l—2pexp(—^nXf asso ) — 

cLoqit cLoqit 7 cLoqit 1 cLoqit ■ ■ mi r~71 

<h ~~ "2 > where o± and o 2 are given in iheoremy^ 
3.3 Sparse precision matrix estimation 

The third example is the folded concave penalized Gaussian quasi-likelihood estimator for 
the sparse precision matrix estimation problem, i.e. 

mm -logdet(0) + (0,S n ,)+ ^ Px{\e jk \), (7) 

where S n = (<7y) g xg is the sample covariance matrix estimator. In particular, under the 
assumption of the Gaussian distrib ution, the sparse precision matrix can be interpreted as a 



sparse Gaussian graphical model (IMeinshausen and Biihlmann 



2006 ; 



Yuan and Linl . 



20071 : 
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Lam and Fad . 120091 ). In this example the target "parameter" (3 is the true precision matrix 



0* 



and the corresponding support set is A = {(j,k) : 9* k ^ 0}. Due to the 



symmetric structure of 0*, the dimension is p = q(q + l)/2 and the cardinality of A is 
s = #{(j,k):j<k&9* k ^0}. 

Now we introduce the oracle precision matrix estimator as follows, 







; oracle 
G 



argmin — logdet(0) + (0, £„) 
subject to 9 jk = 0, V(j, k) G A c . 



Then we can write 



; oracle 
G 



: oracle 
A 







; oracle 
A c J 



: oracle 
A 



0). For ease of notation, we define 



- oracle 
'G 







■ oracle 
G J 



-oracle 



Similarly, we partition S„ and S G in terms of A, i.e. S r 



f ^n -^oracle ^oracle -^oracle 

S_4c) and S G = (S_^ , S^ c ). Note that the Hessian matrix of the negative log- 



quasi-likelihood function has the explicit expression of H* = (0*) 1 (££)(©*) 1 
We also define 



and K-i 



\ h %a( h aa) 1 \ 



We also define the maximal degree as d = max, 4^{k '■ 9* k ^ 0}. 

In the next theorem, we derive explicit bounds for 8\ and 62- For space consideration, 
we only consider the Gaussian distribution. Indeed, we can obtain exactly the same conver- 
gence result of the LLA algorithm for the folded concave penalized Gaussi an quasi-likelihoo d 



problem under the exponential tail or the polynomial tail condition as in 



Caietal 



(j201lh 



To bound 5-\ and 62 u nder the Gaussian assumption, we cite a large deviation result by 



Saulis and Statuleviciusl (119911 ) and lBickel and Levinal (120081 ): for any v such that \v\ < 5, 



Pr(|<7£ - a*\ > v) < C exp(-c nu 2 ) 



where 5, c and C depend on maxj cr*j only. 



,, initial „ , 

Theorem 6. Let 5$ = Pr ||0 G - 0" 



> a A ) . Suppose that 



(AO') ||0^|U n > (a+l)A. 



and we further assume that 



(A3) X = (xi, . . . , x n )' are i.i.d. Gaussian random samples with the covariance matrix 
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^■initial p-oracle 

The LLA algorithm initiated by S G converges to the oracle estimator G after two 
iterations with a probability at least 1 — 5 G — S G — 5 G , where 

a\X 2 1 



CoS ■ exp 1 ~i n ■ min { (2^3+ WKid?> wkW 



+C (p - s) ■ exp 



c$a\ 2 
-nX 



4 



and 



5? 



Cqs ■ exp 



c n 

, 2 - ^ ■ y-^i ■ uun ^g^p, g^p' ^^ l|min 
Theorem tells us that both <5f and 5 G go to zero very quickly. Now we only need to deal 
with 5 G . To initialize t he LLA algorithm , we consider using the constrained l\ minimization 



mm 



10* 



aXf 



estimator (CLIME) by 

• clime 



CaietaL 



(1201 lh . i.e. 







G 



argmin ||0||i subject to ||S n 



e 



- clime 



To obtain the convergence rate of G , we write ||0 



-^11 max ^ ^clime- 



L. As discussed in 



CaietaL 



( 1201 ll ). it is reasonable to assume that L is upper bounded by a constant or L is some slowly 



diverging quantity, because 0* has a few nonzero entri es in each row. We combine the 



concentration bound ([8]) and the same line of proof as in 



CaietaL 



(120 111 ) to show that 



clime 

I 0(7 ~~ Umax ^ 4iyA c 2j me 



with a probability at least 1 — CqP ■ exp(—j%nX 



clime) 



Thus we have the following corollary. 
Corollary 5. Under the assumptions of (AO') and (A3), as long as X is chosen to be greater 



clime 



than 4a l LX c u me , the LLA algorithm initiated by G converges to the oracle estimator 



■ oracle 



G after two iterations with a probability at least 1 — Cop ■ exp(—-%nXl h 



fi G — fi G 



G 



clime) 



3.4 Comments on the Irrepresentable Condition 

So far we have demonstrated the applications of Theorems 1-2 for the LLA algorithm on 
three classical sparse estimation problems. It is well-known that the irrepresentable condition 
is necessary for the l\ penalization method to have the selection consistency property. Here 
we list the corresponding irrepresentable condition for the £i penalized least squares, the 



pena l ized logistic regression and the li penalized precision matrix estima tion ( iZhao and Yu 



2006 



Ravikumar et al. 



2008 



Wainwright 



2009 



Ravikumar et al. 



2010h : 
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(C3) the l\ penalized least squares: there exists some positive constant ■jls £ (0, 1) such 
that 



\X' A cX A (X' A X A 



< 1 - ILS] 



(C4) the £i penalized logistic regression: there exists some positive constant •jLogit £ (0, 1) 
such that 

||X^S(/3*)X A (X^S(/3*)X^)- 1 ||, oo < l- 7 L W t; 

(C5) the £i penalized precision matrix estimation: there exists some positive constant 7^ e 
(0, 1) such that 

Il^(^l4)^lk<l-7G. 



Al 



;hese c o nditions hav e been argued to 



ZhaneJ ( j2010a| ): Fan and Lvi (120111 ): 



j e too restr ictive, for examp le, see IZoul ( 120061 ): 



CaietaL 



(1201 lh and 



XueetaL 



( 120121 ) . From our analy- 



sis it is clear that our theory does not require the initial estimator to be selection consistent. 
Thus we do not need to use these irrepresentable conditions even when the i\ penalized 
estimator is used as the initial estimator. This message is the most interesting in the case of 
sparse precision matrix estimation. We propose to use the CLIME as the initial estimator 
in the LLA algorithm. The reason is that a nice bound can be established for the CLIME 
under the elementwise maximum norm which is exactly what we need in order to apply 
Theorems [T] and |5J It is also interesting to see that although the sparse precision matrix es- 
timation is the most complicated one among three examples, it actually requires the weakest 
regularity conditions to apply Theorems [1] and [2j We have used the restricted eigenvalue 
conditions (CI) and (C2) for sparse least squares regression and sparse logistic regression. 
Based on the current literature, it seems very difficult, if not impossible, to greatly relax (CI) 
and (C2) while keeping a nice bound on the estimation accuracy of t he £^ penaliz e d leas t 



Bickel et al. 



()2009|) 



squares/logistic regression estimator under the £2 loss. According to 
the restricted eigenvalue condition (CI) is also used to derive estimation bounds for the 
Dantzig selector. Hence this condition is still needed if we use the Dantzig selector instead 
of the LASSO as the initial estimator. In contrast, in the sparse precision matrix estimation 
problem we do not need to impose any structure assumption on S* or the Hessian matrix 
H* = (g) S*. The condition on H©*^ is not strong under the strong sparsity assumption 
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on 0*. From this perspective, the sparse precision matrix estimation example is the best 
among the three to demonstrate the power and application of Theorems [T] and [2J 



4 Simulation Studies 

In this section we use simulation to examine the finite sample properties of the folded con- 
cave penalized estimation for solving three classical problems, i.e., sparse linear regression, 
sparse logistic regression and sparse precision matrix estimation. We use several different 
local solution algorithms to compute SCAD/MCP penalized est i mator s. W e fix a = 3.7 in 



the SCAD and a = 2 in the MCP as suggested in 



Fan and Lil (J200l|) and IZhaneJ fl2010a 



respectively. We also include the LASSO penalized estimator in the study. 



4.1 Sparse linear regression and logistic regression models 

First we simulated the independent random samples (cci, yi), . . . , (x n , y n ) from the following 
four sparse linear regression and logistic regression models. 



Models 1 and 2 are sparse linear models. 
Model 1: y = x'(3* + e where /3* = (3, 1.5, 0, 0, 2, P _ 5 ), e ~ N{0, 1) and x ~ N p (0, E) with 

Model 2: The setup is the same as in Model 1, except that f3* is constructed by randomly 
choosing 10 elements in f3* as independent Bernoulli random samples with equal prob- 
ability to be 1 or —1, and setting the other p — 10 elements as zeros. 

We let n = 100 and p = 500 & 1000. We also generated an independent validation set of 
sample size 100 to tune each estimator. The validation error of a generic estimator (3 is 
defined as Exudation - X W ■ 

Models 3 and 4 are sparse logistic regression models. 

Model 3: y follows a Bernoulli distribution with the probability of success being exp(x'f3*) / (1 + 
exp(x73*)), where (3* = (3, 1.5, 0, 0, 2, P _ 5 ) and x ~ N p (0, S) with S = (0.5^1)^. 
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Model 4: The setup is the same as in Model 3, except that (3* is constructed by randomly 
choosing 10 elements in (3* as tis±, . . . , tio-Sio an d setting the other p — 10 elements 
as zeros, where tj's are independently drawn from Unif(l, 2), and s/s are independent 
Bernoulli random samples with Pr(sj = 1) = Pr(sj = —1) = 0.5. 

We let n = 200 and p = 500 & 1000. We also generated an independent validation set of 
sample size 200 to tune each estimator. The validation error of a generic estimator (3 is 
defined as 

J2 (~Vi<P + log(l + exp(^3))) . 

invalidation 

We computed the LASSO penalized linear/logistic regression by the popular R package 
glmnet ([Friedman et all 120121 ) and chose its penalization parameter by minimizing the val- 
idation error. We implemented three local solutions of SCAD/MCP. The first local solution 
was computed by using coordinate descent. We denote it by SCAD-cd/MCP-cd. The second 
local solution, denoted by SCAD-llaO/MCP-llaO, was computed by the LLA algorithm with 
zeros as its initial estimator. The third local solution, denoted by SCAD-lla*/MCP-lla*, was 
computed by the LLA algorithm with the tuned LASSO estimator as its initial estimator. 
SCAD-lla*/MCP-lla* was designed according to the theoretical analysis in Sections 3.1 and 
3.2. Given an initial estimator, we implemented the LLA algorithm for SCAD/MCP by 
using glmnet to solve the weighted l\ penalized estimator at each LLA step. For each local 
solution of SCAD/MCP its penalization parameter was chosen by minimizing the validation 
error. 



Tables [TJ-H] are about here. 



For each model, we generated 100 independent datasets, each consisting n training sam- 
ples and n validation samples. Estimation accuracy is measured by the average i\ loss 
||/3 — (3*\\e 1 over the 100 replications, and selection accuracy is evaluated by the average 
counts of false positive and false negative over the 100 replications. The simulation results 
of Models 1-4 are summarized in Tables HH11 respectively. Needless to say, all SCAD/MCP 
solutions perform much better than the LASSO estimato r. This is a f amiliar message from 



previous works on folded concave penalized estimation ( iFan and Li 



2001 



Fan and Lv 



Zhang 



2010a 



20111 ). We would like to emphasize on comparison between local solutions of 



SCAD/MCP. First, it is very interesting to see that the local solutions of SCAD/MCP are 
very different. Even the two LLA local solutions are noticeably different. This clearly suggest 
that the unique minimizer argument does not apply here. Second, SCAD-lla* and MCP-lla* 
achieve the best performance in both estimation and selection, which gives numeric evidence 
to the theoretical analysis in Section 3.1 and 3.2. When the average FP and FN are zero, the 
estimator is model selection consistent and is also an evidence of finding the oracle estimator. 



4.2 Sparse Gaussian graphical models 

We simulated n independent random vector from N q (0, E*) with a sparse precision matrix 
0* = (E*)" 1 . Models 5 and 6 consider two different sparsity patterns of 0*. 

Model 5: 0* is a tridiagonal matrix by constructing E* = {<Jij) q xq as an AR(1) covariance 
matrix with a*j = exp(— |sj — Sj\) for si < ■ ■ • < s q which are constructed by simulating 
s q — s 9 _i, — Sg_ 2 , • • • , s 2 — Si independently from Unif(0.5, 1); 

Model 6: 0* = U' qxq U q xq + I q xq where U — (uij) qxq has zero diagonals and exactly 
100 nonzero off-diagonal entries. The nonzero entries are generated by Uij = tijSij 
where ty's are independently drawn from Unif(l, 2), and s^'s are independent Bernoulli 
random variables with Pr(sjj = 1) = Pr(sjj = — 1) = 0.5. 

We also generated an independent validation set of sample size n to tune each estimator. 

r-\ i n /Pv\ ^-validation 

The validation error of a generic estimator is defined as — logdet(0) + (0, E n ). In 

our simulation we let q = 100 and n = 100 & 200. 

We computed the l\ penalized Gaussian likelihood est imator, denoted by GLASSO, by 



using the popular R package glmnet ( jFriedman et al.l . 1201 ll ). For ease of presentation, we use 



GSCAD/GMCP to denote the SCAD/MCP pen alized Gaussian likelihood estimator. We 



computed the CLIME by the R package clime flCai et al.l . 120121 ) . GLASSO and CLIME 
were tuned by minimizing its validation error. We considered two LLA local solutions 
of GSCAD/GMCP. The first one, denoted by GSCAD-llaO/GMCP-llaO, uses diag(Er/) 
as the initial estimator in the LLA algorithm. The second one, denoted by GSCAD- 
lla*/GMCP-lla*, uses the tuned CLIME as the initial estimator in the LLA algorithm. 
GSCAD-lla*/GMCP-lla* was designed according to the theoretical analysis in Section 3.3. In 
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the LLA algorithm for GSCAD/GMCP we used glasso to compute the weighted i\ penalized 
Gaussian likelihood estimator at each LLA step. For each local solution of GSCAD/GMCP 
its penalization parameter was chosen by minimizing the validation error. 



Tables EH6] are about here. 

For each model, we generated 100 independent datasets, each consisting n training sam- 
ples and n validation samples. Estimation accuracy is measured by the average Operator 
norm loss ||0 — 0||f 2 and the average Frobenius norm loss ||0 — &\\f over the 100 replica- 
tions, and selection accuracy is evaluated by the average counts of false positive and false 
negative over the 100 replications. The simulation results are summarized in Tables |5] and 
[6j Again, we see that the two local solutions of GSCAD/GMCP are very different, which 
implies that the unique minimizer argument is invalid here. GSCAD-lla* and GMCP-lla* 
achieve the best finite sample performance in both estimation and selection, which gives 
numeric evidence to the theoretical analysis in Section 3.3. 

5 Technical Proofs 
5.1 Proof of Theorem [I] 

Proof. To simplify notation, we let = (3 . Under the event {\\f3^ ^ — /3*|| max < o A}, 
due to the assumption (AO), we have for j G A c 

|/3f | < ||3 (0) - Climax <a \<a 2 \ 

and for j G A 

> WPaIUi- Il3 (0) - Climax >a\. 

Thus by property (iv) of P\(-), P' x (\/3j^\) = for all j G A. Hence, (3^ ^ is the solution of 
the following convex optimization problem 

= argmin £ n (f3) + ^ P' x (Wf\) ' ( 9 ) 

By property (ii) & (iii) of P\(-), we have P((|/3j°' ) |) > aiA for any j G A c . 
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^oracle 

We now show that (3 is the unique global solution to Q9J) under the additional con- 

^oracle ^(1) -^oracle 

dition {||V^c£ n (/3 )||max < a iA}, i.e. (3 = [3 .To see this, note that by convexity, 
we have 

-oracle, ^ — ^oracle 



,^-oracie, ^ — -\ ,^~-oracie, , 



oracle\ 
3 

3 

oracle, v — -v .^oracle. 



,^oracle, v — -\ ,^-oracle, , , , 

= i n {(3 )+^V,4,(/3 m-Pri ( 10 ) 

^-oracle 

where (J3J) was used in the last equality. By fjlOj) and p*^ = 0, we have that for any f3 
{L((3) + E ^fD^-l} - {L0 orade ) + E ^fDI^I} 

ie.A c j<=A c 
> E {^(l/3i 0) D " V/„(3° rQde ) ■ sign(/3,)} • 

-oracle. 



E ^--oracle, , . 

{M-V/ n (^ )-sign(/3 j )}- 1/3,1 



> 0. 



The strict inequality holds unless 0j = 0, Vj G *4. c . This together with the uniqueness of the 

i i -^oracle "?>(■'■) -^.oracle 

solution to (jz]) concludes that (3 is the unique solution to (jH}. Hence, p = p , 
which completes the proof of Theorem [TJ □ 

5.2 Proof of Theorem H 

^oracle -~ 

Proof. Given that the LLA algorithm finds p* at the current iteration, we denote (3 as 
the solution to the convex optimization problem in the next iteration of the LLA algorithm. 
Using f3 A c = and P' x (\/3? rade \) = for j 6 i under the event {||p^ || min > aA}, we 
have 

3 = argmin £ n ((3) + E 7 • Wjl (H) 
where 7 = -P{(0) > aiA. This problem is very similar to Following the same lines 



of the proof as in Theorem [TJ it can easily be seen that under the additional condition 

^oracle ^oracle 

{||V^c^ n (/3 )||max < O'lX}, (3 is the unique solution to (Till) . Hence the loop within 
the LLA algorithm stops, which completes the proof of Theorem [5J □ 
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5.3 Proof of Theorem [3] 



Proof. Note it is sufficient to directly bound Si and 62 for the least-squares problem. Let 
H A = X A (X' A X A )~ 1 X' A . Then, 

VaMPls ) = -X'Ay-Xf3 LS ) 
= ^(X' Ac y - H A y) 
= —X A c(I n xn ~ H A )e, 

where we used y = X A (3 A + e in the last equality. Thus, by the union bound and the 
Chernoff bound, we have 



Si = Pr(||X^ c (J nxn - J3"^)e)|| ma x > ainX) 
< ^2 Pr ( II x' {j) (I n xn ~ H A )e\\ max > a x nX) 

a 2 n 2 X 2 



< 2 2^exp( 

Using the fact that 

\\x'^(I nxn - H A )\\\ = x'^(I nxn - H A )x{j) < \\x(j-)\\j 2 < nM, 
we conclude that 

/ a 2 nX 2 . 
^<2(p- S )exp(-^). 

We now derive an upper bound for <5 2 in the least-squares problem. Noticing that 

^oracle ., A , -, 

(3 A = {X' A X A )- l X' A y = f3 A + {X' A X A y l X' A e. 



we have 



Thus, 



H/3.4 Umin > ||/3^||min _ || (X' A X A ) 1 X' A e\\ meLX . 



5 2 < PrdKX^)" 1 ^^!!^ > \\(3 A \\ min - aX). (12) 

It remains to derive an explicit probability upper bound for ( TT2|) . To facilitate the 
notation, we define 

(X' A X A )~^X' A = (tii, 1*2, ... , u B )', 
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namely Uj = X A (X' A X A ) 1 ej, where ej is the unit vector with j th element 1. It is obvious 
that 

\\ Uj \\l = e'^XUX^-^ < (n\ mm y\ 
By the union bound and the Markov bound again, we have 

P r (\\(X' A X A )~ l X' A e\\ ma _y. > ||/3^||mm ~ ^A) 

< 2 J^exp (^a 2 \\ Uj \\lt 2 - (||/%|| min - aX)t 

where any t > 0. By using the Chernoff bound argument, we set t = a~ 2 \\uj ||^ 2 (||/3^| 
aX) to obtain 

Pr (WiX^X^X'^W^ > H^IU - aA) 



j=1 V ll^lUa 

^■A mm 

2^ 



< 2sexp( — ^p(||/3^||min - aX)' 



Thus, we complete the proof of Theorem [31 □ 

5.4 Proof of Theorem [4] 

Proof. A translation of ()3]) into our setting becomes 

X'M(3 Logit ) = X' A y. (13) 



We now use this to derive the upper bound for S 2 - 
Define a map F : B(r) C W — > W satisfying 

F(A) = ((F A (A A ))', 0')' 

with 

F^(A^) = (X^S(^)X^)- 1 ■ X^(y - + A)) + A^ 

and the convex compact set 

B(r) = {A G M p : \\A A \\ max < r, A^ c = 0} 
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with r = 2Q 2 ■ \\^X' a (/j,(/3*) - y)\\ max . Our aim is to show 

F(l(r)) C B(r) (14) 

when 

||ix^(r)-2/)|| max < ?7 ^27. (15) 

If ( fl4l) holds, by the Brouwer's fixed point theorem, there always exists a fixed point 
A G B(r) such that F(A.) = A. It immediately follows that 

X> = + A) and A^ c = 0, 

-~ ^oracle 

which further implies that (3 + A = (3 Logit by the uniqueness of the solution to (I13]) . Thus, 

ll/W -/3l|ma X =||A|| max <r. (16) 



If further 



then we have 



and by (EE 



|-X^09*) " y)|U < ^-(ll^llmin " OA), 



r < 11/3^ II min - a\ 



■^oracle ^oracle 
WPa llmin > ll/^^Hmin — 1 1/3,4 — /3^||max > OA. 



Therefore, we have 



S 2 < Pr (\\-X' A (^m - y)\\ max > min{-l^, -^-(\\(F A || min - aA)} 
By com bining the union bound and the Hoeffding's bound as in Proposition 4(a) of 



Fan and Lv 



( 20111 ). we have 

f n 2 1 

52 - 2s ' exp {-mqI ' min{ gfgF' 2 (irau " aA)2} 

We now derive (JHj). By using its Taylor expansion around A = 0, 

X>(/T + A) = X>(/3*) + X' A V([3*)XA + i^(A) 
where with A being on the line segment joining and A, 

R A (A) = X' A (s(/T + A) - SO*)) XA. 
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Since A A c = by the definition of B(r), we have XA = X A A A . By the mean- value 
theorem, the entrywise maximum of R A (A) can be bounded as 



||i*4(A)|| max < max A' A X' A dmg{\ Xu) \ o \^"((3)\}X A A A 

i 

for (3 being on the line segment joining f3* and f3* + A. Using the simple fact that \ijj"'(t)\ 
9(t)(l - 0(t)) ■ \28(t) - 1| < \ with 0(t) = (1 + exp(t))- 1 e (0, 1), we have 

n „ „ . , |2 n 



|^(A)|| max < -Qx • \\A A \\i 2 < -Q lS r 2 . (17) 



Noting that 



F A {A A ) = (X' A ^)X A )- 1 X' A (y-^)+^)-^ + A)) + A A 
= {X' A nP)X A y l ■ (X' A (y - !*(!!?)) ~ Jtt(A)), 

we then use the triangle inequality to obtain 

\\F A (A A )\\ max = \\(X' A V(flXj- 1 -(x'jy-X'^(F)-RA& 

< Q 2 - (\\-X^(/3*)-y)\\ max +-\\R A (A)\\ m ., 
\ n n 

By using (fTTI) and the definition of r, we have 

\\F A {A A )\\ nmx < r -+ l -Q l Q 2 sr 2 < r. 

This establishes (fT4j) . 

^ ^oracle 

Next we prove the upper bound for Recall that A = f3 Logit — f3 ■ Let 

1 n 

e Lo 9 U^ ) = _J2 { -y lX ' tf3 + i;( X ' if 3)}. 

By a Taylor expansion, 

Vl L n 09i \Kogu) = VtZ"*W + V 2 ^'(/T) • A 



+ (v 2 C 9lt {(3) - V 2 ^° 5i '(/3*)) • A, 



— ^-oracle 

where (3 is on the line segment joining f3 Logit and (3 . Observe that the first and second 
derivatives of £^° 9lt (f3) can be explicitly written as 

V£^° 9it ((3) = -X' (fi((3) - y) and V 2 £ L n ° 9it ((3) = -X"E(f3)X. (19) 

n n 
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Now we define 

R(A) = h 2 £^ O9it 0) - V 2 ^'(/3*)) ■ A 
= X' (S(/3* + A) - £(/3*)) XA 

We also partition R(A) with respect to A, i.e. R(A) = (R' A (A), R Ac (A))'. Let A = fi-fi* 
Then, using A A c = 0, we have XA = X.4A.4. Substituting this into f|T8|) . we obtain 



V A C 9it 0Lo£t) = V A l L n ° 9it {P) + ±X' A X((3)X A A A + ij^(A), (20) 



V^^'(S r o ;t) = V A ,l L n ° 9i \p) + ix^ c S(/3)X^A^ + ijtt.(A). (21) 



-oracle. 



and 

Logit 

Using f fT9|) and ^ A^n° 9lt (.ft Logit ) = 0, we can solve for A^ from fl20|) and substitute it into 
f l2Tj) to obtain 

r ^oracle, 

V A d L n °^{(3 Logit ) 

= X^S(^)X^(X^S(/3*)X^)- 1 (-ix^ ( M (/3*) - y) - ^(A)) 

+-X' AC (fi((3*)-y) + -R Ac (A). 

n n 

Recall that ( !T6|) under condition ( Tl5|) . If in addition under the event 

{IIV^WlUx < ^} n {||V^(r)IUax < 

we can follow the same lines of proof as in ffTTI) to show that 



|#(A)|| max < -Qi||A^||| 2 < -Qisr\ 



where r = 2Q 2 ■ \\ V A ^ ogit (f3' k ) || max . Noticing that under condition (EE 



71 

-Q lS r 2 = snQrQl ■ \\ V A %° 9it ((3*) \\ 2 m3x < n ■ \\V A C 9it ((3* 



under the same event we have 

-oracle 



\\V A ^° 9it ((3 Logit )\\ 

< Qs- (||V^°^(^)|| max + i||^(A)|| max ^) 

+ l|V^C^(^)l|max + i||^c(A)|| max 

< (2Q 3 + 1) • || V^° 3tt (/3*)|| max + \\V A ^° 9it (f3* 

< a x A. 
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The d esired probability bound can be obtained by using Proposition 4(a) of iFan and Lv 
(120 111 ) and the union bound. This completes the proof of Theorem HI □ 



5.5 Proof of Theorem [5] 

Proof. By definition, it obviously holds that 

Using the convexity of (^ 9%t {-\ we obtain 

(vc au my0 l zz - + \asso\\pTo s g lu < ^mw. 

This entails that on the event 

{\\-X\y-p(J3*))\\ iaax <~\ Uuao } (22) 
n 2 



we have 



1 ^lasso ... ^-lasso 

— -hasso || @Logit ~ P Ml + ^lasso \\ P Logit II h < \asso\\(3 



or 



1 lt ^lasso M „o.n n ^xlasso (l lt ^dasso 

2 

3/assol 1 I a* alasso 



11/5 zO*ll 11/3*11 11/5 II 11/5 jl. . 1 

nWPhogit ~ P Ml < \\P Ml ~ \\PLogitMl + \\PLogit ~ P Ml- 



Using the fact that |/3*| - |/3j asso | + |/3* - /3j asso | = for any j e A c , we conclude that 

1 , , ^ lasso , , ^ lasso 

^\\PLogit-n i <2\\f3 A -f3* A \\ ei 

^lasso ^lasso ^lasso 

where we denote (3 Logit = (f3 A , (3 Ac ) . The last inequality is equivalent to 

,,^lasso,, ,,^lasso 

\\(3 A c Ik < 3H/3.4 -(3 A \\ £l . (23) 
In what follows, our aim is to derive the upper bound 

^lasso ^ ^ in 

Whogit ~ P Ml < 5K Logit S Xl asso 

under the event 022]) • Then the desired probability bound can be obtained by using the 
Hoeffding's bound as in the proof of Theorem HI 
Now we consider a map F : MP — > R satisfying 

F(A) = l L n oai \p + A) - ^'(/3*) + \ lasso (\\f3* + AH,, - 11/3*11,,). 

27 



-lasso 



In addition, we define A = argmhiA F(A). Then by def inition we have A = (3 r „„u — ft 



Since F(0) = 0, F(A) < F(0) = 0. By Lemma 4 of iNegahban et al. 



(2012(), because 



IIA^c)^ < SlIA^H^ as in (123!) and convexity of F(A), it suffices to show that 

F(A) > 

for any A G T>, where 

V = {AeW: \\A A o\\ tl < 3\\A A \\ h and ||A||, 2 = ^k^s 1 / 2 X lasso }. 

To this end, we first obtain a lower bound for \\(3* + A||^ — 11/3*11^, i.e. 

||/3* + A\\ h - = \\(3* A + A A \\ h + \\A Ac \\ h -\\f A \\ ei 

> \\A A c\\ h - \\A A \\ h 



(24) 



Next, we derive a lower bound for ^°^(/3*+A)-^° 9i *(/3*). We define G(u) = £^ 09it (f3* + 
uA). Recall that ifj"(t) = 0(t)(l - 9(t)) and ip"'(t) = 0(t)(l - 9(t))(20(t) - 1) with 0(t) = 
(1 + exp(t))" 1 . Then we have 

G"(u) = -Vf(^ + «A)).KA) 2 



n 



G"'{u) = -Y j r\^ + uA))-(x' l A) 
By using the simple fact that 



o<\r\t)\<r(t), 



we have 



\G"'(u)\ < max \x\A\ ■ G"(u) < m\\A\\ h ■ G"(u). 

i 

Note that by the definition of T>, 

\\A\\ ei = \\A A \\ tl + ||A^c||^ < 4||A^||^ < 4ms 1/2 ||A|| £2 . 
Let z = 4ms 1 / 2 || A\\e 2 = 2QmK~ L ] )git s\i asso > 0. Then we have 

\G"'(u)\ < zG"(u) 



By Lemma 1 of iBachl (120101 ). for any convex three times differentiable function g(u) 
satisfying \g'"{u)\ < Sg"{u) for some S > 0, we have 

g(u) - g(0) - g'{U)u > g"(0) ■ S~ 2 {exp(-uS) + uS - 1}. 



28 



Here we consider g(u) = G(u) and S = z. Let u = 1, and then we obtain 



G(l) - G(0) - G'(0) > G"(0) ■ h(z), (25) 

where h(z) = z~ 2 (exp(— z) + z — 1). By simple calculation it can be shown that h(z) is a 
decreasing function in z > 0. Given that z < 1 holds by assumption on A; asso , we have 

^0) > =exp(-l) > 1/3. 

By definition G{1) = A), G(0) = i L n ° 9it , G'{0) = (V^*(/3*))'A and G"(0) = 

AV 2 !^"()9*)A. Thus, we can re-write ([25]) as 

> (VC 5!< (/3*))'A + jAV\ Lo9,t (/3*)A (26) 

Next, under the event {\\±X'(y - //(/3*))|| max < |A iasso }, we have 

(V^(/3*))'A > -^A teso ||A|| 4 . (27) 

Now under the same event, we combine ( l2"lj) . (12 6p . ( 12"7|) and the restricted eigenvalue condi- 
tion (C2) to obtain 





1 IIA , 








> 


- K Logit\\£±\ 


12 

1/2 


h i 


A| ^ + Xlasso(\ A_4c| £ x 


> 


g I^Logit || A | 


2 

(2 


3 A 1 

2 'Vassol 


IAa||* 


> 


-Kigali A | 


2 

(2 


3 A 

2 'Vasso 


•s 1/2 ||A||, 2 






















> 


0. 









This completes the proof of Theorem [51 □ 



5.6 Proof of Theorem [6] 

^oracle 

Proof. We first derive an upper bound for o 2 = Pr(||0 G || m i n < a\). A translation of (J3J 
into the precision matrix estimation setting becomes 

-~ oracle ^ n 
^A = ^A- 
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Let £ A = (0* + A)" 1 . Define a map F : B(r) C W> 2 — ► W> 2 such that 

F( ^ ;e C (A)) = ((^(^;e C (A^)))^0') , 

with 

F A (vec(A A )) = {H\ A y l ■ (t;ec(£ A ) - vec(ffo) + ^ec(A^) (2 
and the convex compact set 

B(r) = {A : ||A^|| max < r, A^ = 0}, 

where r = 2K 2 ■ \\^ A — S^|| max . We will show that 

F(B(r)) C B(r) (2 

under the condition 

-n 11 
||S^ - £*J max < min{^^, ^p}. (3 

If (129p holds, an application of the Brouwer's fixed point theorem yields that there exists 
fixed point A £ B(r) satisfying 

F A (vec(A A )) = vec(A A ) and A^ c = 0. 

^-v oracle 

In other words, = 0^ — 0^ by the uniqueness and thus 

||e - 0*|| max = ||A|| max < r. (3 
We now establish (12T?|) . For any A £ B(r), we have 

||E*A|| £oo < A\ • \\A\\ tl <K 1 -dr = 2K 1 K 2 d-\\fT A - S^|| max < -, 
by using ( l30l) . Thus, 

oo 

J = £(-1)'(E*A)' 

j=0 

is a convergent matrix series of A. Hence, 

£ A = (J + E^A)- 1 • £* = £* - £*A£* + .R A , (3 
where # A = (£*A) 2 • JS*. Then it immediately yields that 

wec(S A ) - vec{fl A ) = (wec(S^) - wec(E^)) - vec(£*A£*) + wec( J R A ). (3 
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Note that 

£*AE* = (£* (g) £*) ■ vec(A) = H* ■ vec{A) 

and hence 

vec(X*AX A ) = H AA -vec(A A ). 



Now we follow the same lines of the proof as in Lemma 5 of iRavikumar et al.l (120081 ) to 
obtain 

||tf A || max = max| e ^(£*A) 2 - J£*) e,| < \k\ ■ d\\ A||L X - (34) 
Therefore, a combination of ( |28|) . (1331) and (1341) yields the following upper bound, 



\\F A (vec(A 

A)) ||max 

\\(H AA )- 1 ■ (WS^) - vec(£r A )) + vec(R$] 



r /ii * u An \ 

< K 2 ■ (||£_4 — £^|| max + \\R ||max) 



< r. 



This proves (J2SD- 

Under the additional condition 



- n -i* n ^ 1 



I^M ~ Climax < ^-(||©^||min - CtA), 



by ( 13TT) and the definition or r, we have that 



Thus, 



^oracle ^oracle 

I ®_4 ||min — ||®y^||min II® — ® Umax 

= ll®^ll|min ~~ ■ || ~~ S^Hmax 

> a\. 



52 ~ Pr ("^ - ^ l|max > 2k min{ 3^' 3Ap^' " aX} 



An application of ([H]) yields the bound on 62 ■ 

oracle 

We now deal with 81 = Pr(|| V^ c £ n (@ G , )|| max > aiA). Note that 

oracle -~n ^-oracle 



and hence 



^-oracle \\^ n * n ^oracle 

I V,4c^ n (0£; )||max 5: II ^.4 C ~~ S_^c|| max + ||S_^ C - S_^ c || max . (35) 
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^oracle 

Note ||S_4c — S^cjlmax is bounded by using (|8J). Then we only need to bound ||£^ c — 

' — ■ ot'clcIg 

£^|| max . Recall A = © - 0*. By ([32]), we have 

£ = (0* + A) -1 — (I + S*A) _1 ■ £* = £*- S*AE* + R. (36) 

where R = (S*A) 2 • «7S* and J is defined similarly to J with A replaced by A . Then J 
is a convergent matrix series under condition f[30l) . In terms of ^4, we can equivalently write 
([36]) as 

' — ■ ovclcIg ' — ■ 

vec(Y, A ) - fec(S^) = -H AA ■ vec(A A ) + vec(R A ) 

' — ■ ovclcIg ' — ■ ' — . 

v ec(£_4 C ) — vec(Ti A c) = —H* A c A ■ vec(A A ) + vec(R^) 

where we use the fact that A. A c = 0. Solving vec(A. A ) from the first equation and substi- 
tuting it into the second equation, we obtain 

oracle 

vec(T, A c ) — vec(H Ac ) 
= H^^H^y 1 ■ [vec(jf A e ) - wec(S^) - vec(R A ) \ + vec(R A c) 

Recall ([34]) holds under condition ([30]) . Thus, we have 

||-R||max 2^ ' ^^^™ ax = ^KfK^d ' II ~~ S^|| max < ||S^ — S^|| max . 

Thus under the additional event 

5] Ac 5] A c max < ~ r n S S 4 max ^ 



^ ^^cUniax ^ 2 J 1 ^^llmax _ 2 

we derive the desired upper bound for ( [35]) by using the triangular inequality, 

proracle -^oracle 
||V_4ct n (0£< )||max ^ ll^^l c 5j_^ c ||max + |I^^4 C S_^ c Umax 

< ^ + (2K 3 +1)- ||S™ -5^|| max 

< aiA. 

Therefore, 

_ rll i;n _.. r 1 1 (Zi A , , 

5i < Pr{ — S i Lx > min{ , 5-^5-, }} 

+ P r ll|S.4c — S_4c ||max > ~~^J • 

An application of ([8]) yields the bound on 5\ . This completes the proof of Theorem [6] □ 
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Table 1: Numerical comparison of LASSO, SCAD & MCP for the sparse linear regression 
problem in Model 1. Estimation performance is measured by the i\ loss, and selection 
accuracy is measured by counts of false negative (#FN) or false positive (#FP). Each metric 
is averaged over 100 replications with its standard error shown in the parenthesis. 



Method 


n — 
t\ loss 


1 fin S. r n 

LUU cZ p — 

# FP 


ouu 
# FN 


n = 100 & p = 
t x loss # FP 


1 nnn 

1UUU 

# FN 




1.040 


11.36 





1.204 


14.68 





LASSO 
















(0.038) 


(0.63) 


(0) 


(0.045) 


(0.74) 


(0) 




0.333 


1.69 





0.339 


2.22 





SCAD-cd 
















(0.018) 


(0.33) 


(0) 


(0.017) 


(0.40) 


(0) 




0.268 








0.293 








SCAD-llaO 
















(0.012) 


(0) 


(0) 


(0.014) 


(0) 


(0) 




0.267 








0.291 








SCAD-lla* 
















(0.012) 


(0) 


(0) 


(0.014) 


(0) 


(0) 




0.333 


0.77 





0.314 


0.75 





MCP-cd 
















(0.018) 


(0.16) 


(0) 


(0.015) 


(0.14) 


(0) 




0.290 








0.295 








MPC-llaO 
















(0.015) 


(0) 


(0) 


(0.016) 


(0) 


(0) 




0.288 








0.290 








MCP-lla* 
















(0.014) 


(0) 


(0) 


(0.015) 


(0) 


(0) 
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Table 2: Numerical comparison of LASSO, SCAD & MCP for the sparse linear regression 
problem in Model 2. Estimation performance is measured by the i\ loss, and selection 
accuracy is measured by counts of false negative (#FN) or false positive (#FP). Each metric 
is averaged over 100 replications with its standard error shown in the parenthesis. 



Method 


n — 
l\ loss 


1 fin S. r n 

LUU cZ p — 

# FP 


ouu 
# FN 


n = 100 & p = 
t x loss # FP 


1 nnn 

1UUU 

# FN 




4.844 


40.28 





6.829 


53.27 


0.03 


LASSO 
















(0.135) 


(1.06) 


(0) 


(0.171) 


(1.23) 


(0.01) 




1.227 


8.78 





1.288 


11.25 





SCAD-cd 
















(0.036) 


(0.59) 


(0) 


(0.042) 


(0.64) 


(0) 




0.914 





0.04 


1.093 


0.15 


0.15 


SCAD-llaO 
















(0.033) 


(0) 


(0.02) 


(0.074) 


(0.06) 


(0.04) 




0.903 





0.03 


1.064 


0.10 


0.15 


SCAD-lla* 
















(0.031) 


(0) 


(0.01) 


(0.063) 


(0.04) 


(0.04) 




0.948 


1.16 





1.149 


1.29 


0.14 


MCP-cd 
















(0.028) 


(0.17) 


(0) 


(0.131) 


(0.18) 


(0.08) 




0.941 


0.17 


0.01 


1.052 


0.28 


0.07 


MPC-llaO 
















(0.033) 


(0.06) 


(0.01) 


(0.067) 


(0.10) 


(0.03) 




0.928 


0.13 


0.01 


1.031 


0.23 


0.07 


MCP-lla* 
















(0.033) 


(0.05) 


(0.01) 


(0.064) 


(0.09) 


(0.03) 
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Table 3: Numerical comparison of LASSO, SCAD & MCP for the sparse logistic regression 
problem in Model 3. Estimation performance is measured by the i\ loss, and selection 
accuracy is measured by counts of false negative (#FN) or false positive (#FP). Each metric 
is averaged over 100 replications with its standard error shown in the parenthesis. 





n = 200 & p = 


500 


n = 200 & p = 1000 


Method 






t x loss # FP 


# FN 


4 loss # FP # FN 



LASSO 



SCAD-cd 



SCAD-llaO 



SCAD-lla* 



MCP-cd 



MPC-llaO 



MCP-lla* 



5.274 


20.30 


0.01 


5 


670 


24.02 


0.04 


(0.047) 


(0.39) 


(0.01) 


(0 


049) 


(0.44) 


(0.01) 


4.086 


10.79 


0.04 


4 


496 


13.99 


0.08 


(0.054) 


(0.25) 


(0.01) 


(0 


056) 


(0.31) 


(0.01) 


1.851 


0.31 


0.09 


2 


159 


0.31 


0.22 


(0.092) 


(0.04) 


(0.02) 


(0 


108) 


(0.05) 


(0.02) 


1.822 


0.24 


0.10 


2 


080 


0.26 


0.19 


(0.090) 


(0.04) 


(0.02) 


(0 


103) 


(0.04) 


(0.02) 


2.671 


2.23 


0.27 


2 


936 


2.64 


0.47 


(0.056) 


(0.09) 


(0.03) 


(0 


074) 


(0.11) 


(0.03) 


1.880 


0.30 


0.12 


2 


159 


0.45 


0.19 


(0.093) 


(0.04) 


(0.02) 


(0 


108) 


(0.06) 


(0.02) 


1.848 


0.26 


0.09 


2 


146 


0.35 


0.23 


(0.089) 


(0.04) 


(0.02) 


(0 


097) 


(0.05) 


(0.03) 
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Table 4: Numerical comparison of LASSO, SCAD & MCP for the sparse logistic regression 
problem in Model 4. Estimation performance is measured by the i\ loss, and selection 
accuracy is measured by counts of false negative (#FN) or false positive (#FP). Each metric 
is averaged over 100 replications with its standard error shown in the parenthesis. 





n = 200 & p = 


500 


n = 200 & p = 1000 


Method 






t x loss # FP 


# FN 


4 loss # FP # FN 



LASSO 



SCAD-cd 



SCAD-llaO 



SCAD-lla* 



MCP-cd 



MPC-llaO 



MCP-lla* 



13.909 


49.56 





22 


15.079 


55.92 





59 


(0.053) 


(0.62) 


(0 


03) 


(0.061) 


(0.93) 


(0 


04) 


7.906 


20.30 





42 


9.123 


27.72 





58 


(0.129) 


(0.41) 


(0 


04) 


(0.147) 


(0.46) 


(0 


04) 


5.612 


0.90 


1 


50 


6.416 


0.79 


2 


73 


(0.159) 


(0.08) 


(0 


07) 


(0.129) 


(0.06) 


(0 


08) 


5.209 


0.44 


1 


54 


6.413 


0.74 


2 


74 


(0.128) 


(0.05) 


(0 


07) 


(0.143) 


(0.06) 


(0 


09) 


6.227 


3.10 


1 


38 


6.973 


3.62 


1 


46 


(0.121) 


(0.14) 


(0 


04) 


(0.160) 


(0.14) 


(0 


08) 


6.168 


1.18 


1 


44 


6.884 


1.11 


2 


81 


(0.168) 


(0.08) 


(0 


07) 


(0.141) 


(0.09) 


(0 


09) 


5.854 


0.86 


1 


46 


6.300 


0.78 


2 


64 


(0.267) 


(0.07) 


(0 


07) 


(0.135) 


(0.07) 


(0 


08) 
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Table 5: Numerical comparison of GLASSO, CLIME, GSCAD & GMCP for the sparse 
precision matrix estimation problem in Model 5. Estimation performance is measured by 
the Operator norm and the Frobenius norm, and selection accuracy is measured by counts 
of false negative (#FN) or false positive (#FP). 





Operator 
norm 


Frobenius 
norm 


it r r 


# FN 








n = 100 & p 


= 100 








1 


452 


6.115 


743.56 


1 


34 


GLASSO 
















(0 

V 


009"l 


(0.022) 


(10.75) 


fn 

V 


17"! 
1 1 ) 




1 


401 


5.885 


741.16 


2 


42 


CLIME 




















(0.029) 


(12.80) 




9/1 




1 


163 


4.420 


641.82 


1 


96 


LxoL-AJJ-llaU 




















(0.029) 


(9.41) 








1 


162 


4.416 


635.49 


1 


94 


pcpAn lln* 
















1 u 


uiy ) 


(0.029) 


(9.39) 


l u 






1 


527 


4.556 


291.04 


6 


45 


PMPP llnfl 


















Uoo ) 


(0.042) 


(5.12) 




6Z ) 




1 


391 


4.310 


229.87 


6 


29 


PMPP 11q* 
v_r lvlv_y r - lia 


















UOl ) 


(0.037) 


(4.56) 


\ u 


oo ) 








n = 200 & p 


= 100 








1 


270 


5.424 


o r> r* A A 

366.44 





04 


GLASSO 
















(0 


005) 


(0.013) 


(3.44) 


(0 


02) 







962 


3.923 


390.34 





06 


CLIME 
















(o 


007) 


(0.016) 


(4.35) 


(0 


02) 







772 


2.793 


285.15 





26 


GSCAD-llaO 
















(o 


010) 


(0.013) 


(3.05) 


(0 


02) 







746 


2.514 


285.13 





06 


GSCAD-lla* 
















(o 


007) 


(0.012) 


(3.05) 


(0 


02) 







755 


2.517 


180.85 





32 


GMCP-llaO 
















(o 


009) 


(0.015) 


(3.17) 


(0 


05) 







725 


2.468 


152.06 





38 


GMCP-lla* 
















(o 


008) 


(0.011) 


(2.31) 


(0 


05) 
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Table 6: Numerical comparison of GLASSO, CLIME, GSCAD & GMCP for the sparse 
precision matrix estimation problem in Model 6. Estimation performance is measured by 
the Operator norm and the Frobenius norm, and selection accuracy is measured by counts 
of false negative (#FN) or false positive (#FP). 



Method 


Operator 
norm 


Frobenius 
norm 


# FP 


# FN 






n = 100 & p 


= 100 






1 1 631 


95 447 


936 76 


56 16 


V il..\ ).iv ) 












(0.015) 


(0.032) 


(5.19) 


(0.52) 




8 558 


1 8 404 


393 04 


1 9 96 


m IMF, 












(0.053) 


(0.075) 


(7.22) 


(0.38) 




10.727 


20.683 


228.70 


54.54 


GSCAD-llaO 












(0.048) 


(0.121) 


(4.92) 


(0.58) 




6.416 


13.363 


196.60 


30.02 


GSCAD-lla* 












(0.126) 


(0.153) 


(5.27) 


(0.57) 




10.337 


19.202 


200.37 


52.24 


GMCP-llaO 












(0.071) 


(0.120) 


(4.26) 


(0.60) 




5.977 


12.741 


44.79 


25.18 


GMCP-lla* 












(0.167) 


(0.169) 


(3.42) 


(0.61) 






n = 200 & p 


= 100 






11.492 


94 857 


78 1 8 


49 74 


V i 1 . . \ ) . i v f 












(0.009) 


(0.021) 


(1.51) 


(0.33) 




O.OOiJ 


1 9 ^98 


00\J .00 


1 Q9 


m IMF, 












(0.056) 


(0.061) 


(5.21) 


(0.12) 




10.411 


18.820 


76.16 


47.62 


GSCAD-llaO 












(0.034) 


(0.092) 


(1.47) 


(0.34) 




3.739 


7.633 


67.00 


9.58 


GSCAD-lla* 












(0.059) 


(0.074) 


(1.65) 


(0.28) 




9.937 


16.813 


75.61 


45.08 


GMCP-llaO 












(0.053) 


(0.076) 


(1.41) 


(0.33) 




3.406 


7.160 


14.64 


6.36 


GMCP-lla* 












(0.054) 


(0.070) 


(1.20) 


(0.27) 
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